Home Credit Default Risk

Group 23

TEAM AND PROJECT META INFORMATION

Email IDs:

Raj Chavan : rchavan@iu.edu

Sanket Bailmare: sbailmar@iu.edu

Shefali Luley: sluley@iu.edu

Tanay Kulkarni : tankulk@iu.edu

Group : 23

Members: Raj Chavan Sanket Bailmare Shefali Luley Tanay Kulkarni

Raj.pngSanket.pngShefali.pngTanay.png

PROJECT ABSTRACT

In today’s world, many people struggle to get loans due to insufficient credit histories or even non-existing credit records, which often tend to untrustworthy lenders who take advantage of the population. Home Credit acts towards expanding the financial inclusion for the unbanked by providing a secure borrowing experience. Home credit utilizes several alternative data and methods such as clients' background facts, their transactional information to predict their repayment abilities. To ensure that this underserved demographic has a favorable loan experience, we will be using machine learning and statistical methods to determine these predictions. With the help of this, Home credit will ensure that the clients who are capable of repayment will be granted a loan and are not rejected by any means. Also, they will be given a loan maturity plan and a repayment calendar that will accredit our clients to be more successful. In the previous stage, we understood the problem and data, after which we devised our plan of action. Our goal in this phase is to work on the Home Credit Default Risk (HCDR) data and perform some cleaning and preprocessing techniques as well as build the baseline pipeline. We begin with understanding the data, basically to know, what kind of data we are presented with and how many rows and columns in what form are present in the data. We do this by using various functions to get descriptive statistics of each column. After understanding the data, we clean it. This is performed by removing the columns with more than 50% NaN (Null) values. Data cleaning is an important step as it will help us better comprehend what we want to achieve. After cleaning the data, we apply some preprocessing steps which include but are not limited to finding a correlation, dropping columns, finding the mean and median, and also dividing the data into numerical and categorical values. We create a baseline pipeline for the numerical and categorical values and eventually merge them using another pipeline. Finally, we use this data to train our model. We also perform visual EDA to get the results and get more useful insights from the data.

PROJECT DESCRIPTION

Home Credit is a dataset provided by Home Credit Default risk, a service dedicated to providing lines of credit (loans) to the unbanked population. application.csv: This is the main dataset divided into train and test datasets, it also contains information about loan and loan applicants at their application time. bureau.csv: This file contains information about the loan history of clients from institutes and which were reported to the Credit Bureau. Furthermore, there is one row focused per client's loan in the Credit Bureau. bureau_balance.csv: This file contains monthly balances of earlier credits in the Credit Bureau. previous_application.csv: This file contains information about the applicant's previous loan in Home credit, also it consists of information about previous loan parameters and the client's information at that particular time. POS_CASH_balance.csv: This file contains the monthly balance (snapshots) of the previous point of sales (POS) and loans in the form of cash that the applicant had with Home Credit. installments_payments.csv: This file contains the previous payment history of clients for each installment for the earlier credits in Home Credit related to the loan in our sample. credit_card_balance.csv: This file contains the monthly balance (snapshots) of clients' previous credit card history with Home Credit.

DataSet Link: https://www.kaggle.com/c/home-credit-default-risk/data

The task to be tackled:

Diagram : This is a block diagram to understand the workflow of the data. Screenshot 2021-11-16 at 10.23.03 PM.png

EXPLORATORY DATA ANALYSIS + FEATURE ENGINEERING AND TRANSFORMERS

Data description:

df.info() function provides description about the dataset like the number of columns and rows as well as the types of data present in the dataset.

df_test.describe() function will give the summary statistics and valuable information like count, mean, minimum, maximum and so on.

Feature Engineering:

The first step to deal with the data was to remove the columns which would act as redundant as it would not contribute in prediction.We explored the data and saw the number of missing values. We removed the columns which had more than 50% of missing values. We checked the columns for the number of 0's distribution and removed the columns which has 90% rows with only values as 0's. Further more we divided the data to identify if it is Numerical and categorical. The numerical data was dealt by creating an intermediate Imputer pipleine where the numerical missing values were replaced with the mean of the the data and the missing values in Categorical missing data was dealt by performing OHE(One hot Encoding) and replacing the missing values with Mode of the columns.

Declaring some functions

Columns with more than 90% zero values in them

Dropping the above columns from the dataset

Columns in training set having more than 30% of missing data, along with their median/mode and unique values in the column

Segregating the Dataset in numerical and categorical dataframes

Numerical Columns and their correlation with the TARGET column in descending order

The Columns named "NAME_FAMILY_STATUS, CODE_GENDER, NAME_INCOME_TYPE"does not have values 'Unknown','XNA' and 'Maternity Leave' in the test dataset thus these rows are removed from the training dataset and there are a total of 11 rows that are removed.

Considering Columns that have more than 2% correlation with the TARGET variable

Out of these columns we first need to deal with missing values of the columns. Thus we get all columns which have missing values

Here in cols_with_no_missing we keep columns not having missing values

We set a threshold of 10 for unique values in a column, if it is 10 or less we do not consider them as numerical but discrete and replace the missing values with the mode of the column

Finally, we get all the column names for the numerical columns that we will consider in our modeling and also imputation

This gives us the final numerical dataframe

Here, we get the missing value of columns for the categorical columns, we see that out of 16 columns we get 6 columns that have missing values

Getting the counts of each category in these columns

Here we observe that for columns :

1)FONDKAPREMONT_MODE there is 68% missing data which makes it not ideal to impute missing values with the mode

2)WALLSMATERIAL_MODE, here the difference between the first and the second most occuring values is not much thus imputation with Mode is ambiguous

3)OCCUPATION_TYPE, this is a column that has 31% missing value and again there can not be one specific value that can be decided to impute the missing data with

Thus we remove these columns from the categorical part of the dataframe

VISUAL EXPLORATORY DATA ANALYSIS

Visualizing the categorical columns to understand the data more efficiently

How is the distribution of of loan according to gender?

Inference: The number of female borrowing the loan and who haven't paid is comparatively higher than men.

What is the marital status of client?

Inference: The majority of client who are Married have paid the least loan amount while the status of unknown is negligible.

How many percent of client own a car?

Inference: About 50 % of people own's a car, but there's majority of client (more than 50%) who doesn't possess a car and most of them are likely who haven't paid the loan.

What type of educational background does the clients have?

Inference: Clients with Academic Degree are more likely to repay the loan compared to others.

What are the types of housing does the clients stay in?

Inference: From the above graphical presentation, we can see that majority of the clients stay in apartment/House haven't paid the loan amount, while the least amount of them stay in office apartment and co-op apartment are negligible.

What are income type of applicant in terms of loan does the clients have?

Inference: All the Students and Businessman are neglibile, here we can see that majority of working people are hardly paying the loan.

Inference: The loan approval process has the highest count starting tuesday ,while the lowest count can be clearly seen on the weekends.

What type of loan are available?

Inference: Many people are willing to take cash loan than revolving loan

Here, we plot some graphs for the columns with highest correlations with the Target variable and observe the trend with respect to the target variable.

Inference:We see these for the columns EXT_SOURCE_3, EXT_SOURCE_2 and EXT_SOURCE_1 and can observe a clear strong negative correlation

What type of correlation does the columns DAYS_BIRTH and DAY_LAST_PHONE_CHANGE have with respect to target?

Inference: From the above plots,we can see that for columns of DAYS_BIRTH and DAYS_LAST_PHONE_CHANGE we see a strong positive correlation with respect to the target.

MODELING PIPELINES

IMG_6A43389925BA-1.jpeg

In this project, we are creating three pipelines, one for numerical data, one for categorical data and finally a pipeline to combine the data.

(i).Numerical data pipeline: For the pipeline with numerical data also called ‘num_pipeline’, we impute the missing values by the mean of the columns.

(ii). Categorical data pipeline: For the pipeline with categorical data also called ‘cat_pipeline’, we impute the missing values by the mode or the most frequent data.

(iii)Final pipeline: We create a pipeline to merge the numerical and categorical columns that have no missing values. The categorical columns are also one hot encoded.

Importing necessary packages

Finally selecting only the columns that we have finally decided for the numerical and the categorical part

Making two pipelines one for the numerical data where we impute the missing values by the mean of the columns, and the other for categorical data where we impute the missing data in the categorical columns using the Mode or the most frequent data.

Here we create a pipeline to merge the numerical and categorical columns that have no missing values and the categorical columns are one hot encoded.

This is the final dataset that we get for training our model

RESULTS AND DISCUSSION OF RESULTS

Here we will be using the lbfgs solver which is a Limited-memory BFGS (L-BFGS or LM-BFGS) optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory

Here we see that Random Forest Classifier has the best training accuracy of around 99% but an accuracy like this runs a risk of overfitting. The logistic Regression Model also gives an accuracy of 91% which is descent enough and seems to be a consdierable model. The ROC Area Under Curve value for Random Forest Classifier and the Logistic regression models are 0.704 and 0.735 respectively and both of them show a significant amount of True Positive values which shows that this is a good model fit. We cannot consider Naive Bayes as our model as it is evidently underfitting the data. We need to check how Random Forest classifier and the Logistic Regression model works on the test data to confirm if Random forest is really overfitting.

For Kaggle submission

In the following pipeline we use standard scaler to normalize the data with mean being zero and the standard deviation being 1, and use Logistic Regression as our modeling algorithm with solver being lbfgs and interations for convergence being 1000

Test Dataset

Here we get the same final columns as for the training dataset in the test dataset

Kaggle Test Accuracy

For Logistic Regression

Screen Shot 2021-11-16 at 3.01.05 PM.png

For Random Forest Classifier

Screen Shot 2021-11-16 at 6.29.33 PM.png

With high training accuracy and a low test accuracy we can see that the Random forest classifier seems to be over fitting and thus the ideal choice of model would be Logistic Regression

CONCLUSION

In this phase, we actually start playing with the dataset. It included understanding the dataset, what information we are presented with and then cleaning the data accordingly. We picked only the features which were important to the target variable and prediction.We featured the data performing OHE and applied imputing methods to fix the data before feeding it to the model. We were able to create the baseline pipeline and could experimentally understand the accuracies of the models like logistic regression, naive bayes and Random forest. Based on the results of the models we saw that there might be underfitting in naive bayes and overfitting in RandomForest.The best model that we could get for Phase 0 was Logistic Regression which gave a training accuracy of 91.9% and the kaggle submission accuracy of 73.6%. Further we are planning to improvise the feature engineering, perform hyperparameter tuning for our models alongside using K-Fold cross validation and GridSearchCV, we might also use some advanced gradient boosting models so that we could get as close to the best accuracy as we can. After the aforementioned step we also plan to apply Deep Learning techniques like developing Artificial Neural Networks for better prediction results.

References:https://www.kaggle.com/c/home-credit-default-risk/data